LitLin 19_4 453-475 fqh034 FIN

نویسنده

  • David L. Hoover
چکیده

Delta, a simple measure of the difference between two texts, has been proposed by John F. Burrows as a tool in authorship attribution problems, particularly in large ‘open’ problems in which conventional methods of attribution are not able to limit the claimants effectively. This paper tests Delta’s effectiveness and accuracy, and shows that it works nearly as well on prose as it does on poetry. It also shows that much larger numbers of frequent words are even more accurate than the 150 that Burrows tested. Automated methods that allow for tests on large numbers of differently selected words show that removing personal pronouns and words for which a single text supplies most of the occurrences greatly increases the accuracy of Delta tests. Further tests suggest that large changes in Delta and Delta z-scores from the likeliest to the second likeliest author typically characterize correct attributions, that differences in point of view among the texts are more significant than differences in nationality, and that combining several texts for each author in the primary set reduces the effect of intra-author variability. Although Delta occasionally produces errors in attribution with characteristics that would normally lead to a great deal of confidence, the results presented here confirm its usefulness in the preliminary stages of authorship attribution problems. LitLin 19_4 453-475 fqh034 FIN 20/10/04 9:02 am Page 453

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LitLin 18_4 423-447 fqh009 FIN

Large, real world, data sets have been investigated in the context of Authorship Attribution of real world documents. Ngram measures can be used to accurately assign authorship for long documents such as novels. A number of 5 (authors) 5 (movies) arrays of movie reviews were acquired from the Internet Movie Database. Both ngram and naive Bayes classifiers were used to classify along both the au...

متن کامل

LitLin 18_4 361-378 fqh002 FIN

This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. We first discuss the major decisions we took when building the corpus. These relate to sampling, text collection, mark-up, and annotation. Following from this we use the corpus to study aspect marking in Chinese and British/American ...

متن کامل

Table DP-1. Profile of General Demographic Characteristics: 2000 Geographic area: Houston County, Tennessee

Under 5 years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 6.7 5 to 9 years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554 6.8 10 to 14 years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 7.0 15 to 19 years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 5.9 20 to 24 years . . . . . . . . . . . . . . . . . . . . ....

متن کامل

Health of Calcutta during the First Quarter of 1878

possible improvement in the accuracy and completeness of registration, but the difference is too material and marked to be explained by such a hypothesis. The increase is apparent under all the diseases specified in the Health Officer's tables save one. Under fevers the numbers are 1,272 against 929; chclera 475 against 644 ; bowel complaints, (dysentery and diarrhoea) 570 against 453 ; small-p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004